Search CORE

HAL - Université de Franche-Comté

CiteSeerX

HAL-IN2P3

Hal - Université Grenoble Alpes

arXiv.org e-Print Archive

An Efficient Multiway Mergesort for GPU Architectures

Author: Casanova Henri
Iacono John
Karsin Ben
Sitchinava Nodari
Weichert Volker
Publication venue
Publication date: 01/01/2017
Field of study

Sorting is a primitive operation that is a building block for countless algorithms. As such, it is important to design sorting algorithms that approach peak performance on a range of hardware architectures. Graphics Processing Units (GPUs) are particularly attractive architectures as they provides massive parallelism and computing power. However, the intricacies of their compute and memory hierarchies make designing GPU-efficient algorithms challenging. In this work we present GPU Multiway Mergesort (MMS), a new GPU-efficient multiway mergesort algorithm. MMS employs a new partitioning technique that exposes the parallelism needed by modern GPU architectures. To the best of our knowledge, MMS is the first sorting algorithm for the GPU that is asymptotically optimal in terms of global memory accesses and that is completely free of shared memory bank conflicts. We realize an initial implementation of MMS, evaluate its performance on three modern GPU architectures, and compare it to competitive implementations available in state-of-the-art GPU libraries. Despite these implementations being highly optimized, MMS compares favorably, achieving performance improvements for most random inputs. Furthermore, unlike MMS, state-of-the-art algorithms are susceptible to bank conflicts. We find that for certain inputs that cause these algorithms to incur large numbers of bank conflicts, MMS can achieve up to a 37.6% speedup over its fastest competitor. Overall, even though its current implementation is not fully optimized, due to its efficient use of the memory hierarchy, MMS outperforms the fastest comparison-based sorting implementations available to date

DI-fusion

Computing the expected makespan of task graphs in the presence of silent errors

Author: Casanova Henri
Herrmann Julien
Robert Yves
Publication venue: HAL CCSD
Publication date: 16/08/2016
Field of study

International audienceApplications structured as Directed Acyclic Graphs (DAGs) of tasks correspond to a general model of parallel computation that occurs in many domains, including popular scientific workflows. DAG scheduling has received an enormous amount of attention, and several list-scheduling heuristics have been proposed and shown to be effective in practice. Many of these heuristics make scheduling decisions based on path lengths in the DAG. At large scale, however, compute platforms and thus tasks are subject to various types of failures with no longer negligible probabilities of occurrence. Failures that have recently received increasing attention are " silent errors, " which cause a task to produce incorrect results even though it ran to completion. Tolerating silent errors is done by checking the validity of the results and re-executing the task from scratch in case of an invalid result. The execution time of a task then becomes a random variable, and so are path lengths. Unfortunately, computing the expected makespan of a DAG (and equivalently computing expected path lengths in a DAG) is a computationally difficult problem. Consequently, designing effective scheduling heuristics is preconditioned on computing accurate approximations of the expected makespan. In this work we propose an algorithm that computes a first order approximation of the expected makespan of a DAG when tasks are subject to silent errors. We compare our proposed approximation to previously proposed such approximations for three classes of application graphs from the field of numerical linear algebra. Our evaluations quantify approximation error with respect to a ground truth computed via a brute-force Monte Carlo method. We find that our proposed approximation outperforms previously proposed approaches, leading to large reductions in approximation error for low (and realistic) failure rates, while executing much faster

Checkpointing vs. Migration for Post-Petascale Machines

Author: Cappello Franck
Casanova Henri
Robert Yves
Publication venue: HAL CCSD
Publication date: 01/01/2009
Field of study

We craft a few scenarios for the execution of sequential and parallel jobs on future generation machines. Checkpointing or migration, which technique to choose

arXiv.org e-Print Archive

HAL-CentraleSupelec

Hal - Université Grenoble Alpes

HAL-Rennes 1

From Simulation to Experiment: A Case Study on Multiprocessor Task Scheduling

Author: Casanova Henri
Hunold Sascha
Suter Frederic
Publication venue: HAL CCSD
Publication date: 16/05/2011
Field of study

International audienceSimulation is a popular approach for empirically evaluating the performance of algorithms and applications in the parallel computing domain. Most published works present results without quantifying simulation error. In this work we investigate accuracy issues when simulating the execution of parallel applications. This is a broad question, and we focus on a relevant case study: the evaluation of scheduling algorithms for executing mixed-parallel applications on clusters. Most such scheduling algorithms have been evaluated in simulation only. We compare simulations to real-world experiments in a view to identify which features of a simulator are most critical for simulation accuracy. Our first finding is that simple yet popular analytical simulation models lead to simulation results that cannot be used for soundly comparing scheduling algorithms. We then show that, by contrast, simulation models instantiated based on brute-force measurements of the target execution environment lead to usable results. Finally, we develop empirical simulation models that provide a reasonable compromise between the two previous approaches

HAL-IN2P3

Crossref

Low-latency XPath Query Evaluation on Multi-Core Processors

Author: Casanova Henri
Karsin Benjamin
Lim Lipyeow
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2017
Field of study

XML and the XPath querying language have become ubiquitous data and querying standards used in many industrial settings and across the World-Wide Web. The high latency of XPath queries over large XML databases remains a problem for many applications. While this latency could be reduced by parallel execution, issues such as work partitioning, memory contention, and load imbalance may diminish the benefits of parallelization. We propose three parallel XPath query engines: Static Work Partitioning, Work Queue, and Producer- Consumer-Hybrid. All three engines attempt to solve the issue of load imbalance while minimizing sequential execution time and overhead. We analyze their performance on sets of synthetic and real-world datasets. Results obtained on two multi-core platforms show that while load-balancing is easily achieved for most synthetic datasets, real-world datasets prove more challenging. Nevertheless, our Producer-Consumer-Hybrid query engine achieves good results across the board (speedup up to 6.31 on an 8-core platform)

Crossref

ScholarSpace at University of Hawai'i at Manoa

AIS Electronic Library (AISeL)

A Comparison of Scheduling Approaches for Mixed-Parallel Applications on Heterogeneous Platforms

Author: Casanova Henri
N'Takpé Tchimou
Suter Frédéric
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/07/2007
Field of study

International audienceMixed-parallel applications can take advantage of large-scale computing platforms but scheduling them efficiently on such platforms is challenging. In this paper we compare the two main proposed approaches for solving this scheduling problem on a heterogeneous set of homogeneous clusters. We first modify previously proposed algorithms for both approaches and show that our modifications lead to significant improvements. We then perform a comparison of the modified algorithms in simulation over a wide range of application and platform conditions. We find that although both approaches have advantages, one of them is most likely he most appropriate for the majority of users

Crossref

Toward More Scalable Off-Line Simulations of MPI Applications

Author: Casanova Henri
Gupta Anshul
Suter Frédéric
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/09/2015
Field of study

International audienceThe off-line (or post-mortem) analysis of execution event traces is a popular approach to understand the performance of HPC applications that use the message passing paradigm. Combining this analysis with simulation makes it possible to " replay " the application execution to explore " what if? " scenarios, e.g., assessing application performance in a range of (hypothetical) execution environments. However, such off-line analysis faces scalability issues for acquiring, storing, or replaying large event traces. We first present two previously proposed and complementary frameworks for off-line replaying of MPI application event traces, each with its own objectives and limitations. We then describe how these frameworks can be combined so as to capitalize on their respective strengths while alleviating several of their limitations. We claim that the combined framework affords levels of scalability that are beyond that achievable by either one of the two individual frameworks. We evaluate this framework to illustrate the benefits of the proposed combination for a more scalable off-line analysis of MPI applications

HAL-IN2P3